1 Introduction

John W. Tukey (1977) famously stated “The greatest value of a picture is when it forces us to notice what we never expected to see.” (p vi).

Visualisation has always been an integral part of data analysis. Cook and Swayne (2007) define data visualisation as the use of graphics to “support and enrich the statistical processes of data exploration, modeling and inference” (p. 3). In addition, interactive data visualisation allows the user to interact with graphical features in order to achieve fast and flexible analysis (Unwin 2015). However Unwin (2015) highlights one of the challenges of interactive data visualisation is in documenting the analysis. Hence the focus of this research project is on code-based software.

This section examines the different types of software available and a toolbox of interactive techniques, before focusing on a set of “current” software tools for enabling commonly used interactive techniques.

The two stages of this research project reflect a learning process that students may undertake for a semester-long graduate level course using code-based, open source software for interactive data visualisation.

1.1 Method

The first stage of this research entailed a survey on interactive techniques and the range of open-source software currently available. A literature review was undertaken to identify commonly used interactive techniques and hence the techniques that should be prioritised when teaching interactive data visualisation. The survey of current software tools was then aimed towards implementing these interactive techniques and evaluating the ease with which they could be achieved. The coverage of techniques by each software was used to determine the set of software tools to use for exploring the role of interactive visualisation in the data analysis cycle. This second stage of research applied interactive techniques to exploratory data analysis of a data set not previously examined in the literature.

1.2 Findings

Identification, filtering, different types of linked brushing and tours, were found to be commonly used techniques in interactive data visualisation. Utilising the R packages plotly and crosstalk together, or in combination with shiny software, provided a code-based, open-source approach towards applying these interactive techniques. Awareness towards the limitations and code efficiency of each software, enabled the efficient application of interactive techniques to exploratory data analysis of a “real” data set.

Applying interactive techniques to data analysis resulted in further insight into underlying multivariate structures. For example tooltip identification of outliers and linked brushing of individual and groups of observations, capitalised on the multivariate views offered by parallel coordinates plots. Furthermore interactive filtering helped to reduce problems caused by overplotting and allowed the effect of sample size on analysis to be explored.

Interactive techniques also helped to develop a deeper understanding of abstract multivariate data analysis methods. Linked brushing between the visual projections of a guided tour and its respective projection pursuit index, prompted further exploration of the multidimensional data space.

The benefits of utilising interactivity in data analysis outweighed the effort required to implement the interactive techniques. Once sufficient mastery of software was acquired, the interactive techniques could be applied to new analysis in novel ways. The ease of applying interactive techniques to gain deeper insight in analysis and a thorough understanding of underlying processes, justifies teaching interactive data visualisation to graduate students.

2 A survey of interactive techniques and software

2.1 Literature review of interactive techniques

The data analysis process described by Cook and Swayne (2007) provided a structure and context for meaningful interactive data visualisation. They emphasised the role of interactive techniques in identifying the problem statement, preparing the data, enriching the exploratory and quantitative analysis, and lastly in the presentation of findings. The interactive techniques introduced by Cook and Swayne (2007) will form the basis of this section, intermixed with contributions from other literature. A set of commonly used interactive techniques will be discussed in more detail and used to evaluate the software available.

2.1.1 Different types of interactive techniques

In their worked examples, Cook and Swayne (2007) primarily used the interactive data visualisation software, GGobi. Although GGobi is open-source, it is not a code-based software. The R packages plotly, crosstalk and shiny were used to create the examples in this section. References to the strengths and weaknesses of these software will be mentioned, but discussed in more detail and summarised in the next section (Section ??).

The crabs dataset will be used to demonstrate the interactive techniques where necessary. The data consists of information on the species (Blue or Orange), sex and five physical measurements of 200 Australian crabs: frontal lobe (FL), rear width (RW), carapace length (CL), carapace width (CW) and body depth (BD).

Table 2.1: A subset of the crabs dataset.
species sex FL RW CL CW BD
1 B M 8.1 6.7 16.1 19.0 7.0
2 B M 8.8 7.7 18.1 20.8 7.4
51 B F 7.2 6.5 14.7 17.1 6.1
52 B F 9.0 8.5 19.3 22.7 7.7
101 O M 9.1 6.9 16.7 18.6 7.4
102 O M 10.2 8.2 20.2 22.2 9.0
151 O F 10.7 9.7 21.4 24.0 9.8
152 O F 11.4 9.2 21.7 24.1 9.7

2.1.1.1 Linked brushing

The interactive technique of linked brushing involves using the mouse to select one or more graphical features, which then prompts related features in the same plot and/or other plots to be highlighted in the same colour (Cook and Swayne 2007). There are different types of linked brushing, resulting from the different linking rules used. The most basic form of linking is one-to-one. Figures 2.1 and 2.2 demonstrate linked brushing between individual points of a plot representing the two categorical variables in the crabs dataset and a scatterplot of the carapace length and rear width. The linking can be initiated from either plot. The brushed group of points are marked by a dashed boundary. Figure 2.2 also demonstrates persistent brushing is also applied to identify the sex and species of an unusual observation and then the group of smaller crabs.

The group of points representing male crabs of the Blue species is brushed to link with their carapace length and rear width measurements in the scatterplot.

Figure 2.1: The group of points representing male crabs of the Blue species is brushed to link with their carapace length and rear width measurements in the scatterplot.

An example of one-to-one linking and persistent brushing. The crab with the largest rear width measurement was first brushed, followed by the group of three smaller crabs.

Figure 2.2: An example of one-to-one linking and persistent brushing. The crab with the largest rear width measurement was first brushed, followed by the group of three smaller crabs.

Instead of linking by case, observations can also be linked by a variable. When a categorical variable is used in the linking rule, all members of the same level are highlighted, once one or more members are brushed (Cook and Swayne 2007). Figure demonstrates categorical brushing being applied within one scatterplot, compared to Figure where two plots are linked to achieve the same result.

For web-based software, brushing on visuals representing aggregated data, such as the bar plot in Figure, is more challenging than linking by case (RStudio/Cheng, crosstalk website). Although not impossible, the user needs to explicitly communicate details of the aggregation. This becomes more challenging when continuous variables are aggregated, or mosaic plots involving several categorical variables are used. The web-based R package animint attempts to make the implementation of aggregate brushing easier, but it is currently limited to discrete values and does not incorporate the flexibility of brushing by case (Hock paper).

Linked brushing within a single plot can also be useful. (Explain what a PCP is?). Figure 2.3 applies m-to-n linking to identify the physical measurements of an individual crab in a parallel coordinates plot (PCP) of the five real-valued variables. Brushing on one point initiates the linking of m nodes with n edges.

An example of m-to-n linking and tooltip identification. The five physical measurements of a male Blue crab are highlighted after brushing one of the nodes on the parallel coordinates plot's axes.

Figure 2.3: An example of m-to-n linking and tooltip identification. The five physical measurements of a male Blue crab are highlighted after brushing one of the nodes on the parallel coordinates plot’s axes.

2.1.1.2 Identification

Identification using tooltips is also demonstrated in Figure 2.3. Labels containing prespecified variable values appear as the mouse “hovers” near graphical features representing these values. Being able to interactively and instantly identify values for outliers is particularly useful (Cook and Swayne 2007).

2.1.1.3 Line segments

The ease of brushing line segments affects the use of interactive data visualisation in network analysis. Cook and Swayne (2007) highlight the usefulness of being able to brush edges and nodes in network graphs to explore connections. Unlike GGobi, the web-based software tools examined were unable to brush lines. The linked brushing in Figure 2.3 was achieved by brushing a node on the RW variable’s axis. Similarly, the Mondrian software, which provides a GUI for interactive data visualisation, limits brushing on a PCP to the nodes (Mondrian website). Linked brushing on dendrograms, using the plotly function plot_dendro(), is also restricted to brushing the nodes (see demonstration on https://vimeo.com/189670650).

However the usefulness of line segments to interactive data visualisation is not restricted to linked brushing. Line segments are essential to visualisations involving longitudinal data and assist in representing models (Cook and Swayne 2007). Wickham, Cook, and Hofmann (2015) demonstrate how grid lines representing clustering models generated from an algorithm, allows for the process to be evaluated when the models are viewed dynamically across iterations.

2.1.2 Subset selection

Subset selection involves using only a portion of the data for visualisation and/or analysis. This technique can help elleviate the computational strain and overplotting caused by large datasets (Cook and Swayne 2007). Figure demonstrates how the shiny software provides different interactive input controls for subsetting data before mapping data to graphical features and model fitting. The interactive slider filters the data using the continuous variable XX, while the radio buttons subset the data by the species of the crab. Using shiny allows for flexible subset selection, since its interactive visuals are connected to an active R session which allows for analysis to be dynamically updated.

2.1.3 Tours: A dynamic multivariate visual representation

Wickham et al. (2011) highlight tours as essential for gaining insight into the structures underlying real-valued multivariate data. A tour is an animation consisting of static low-dimensional projections of the high-dimensional data space. The projections to include in a tour can be manually determined, randomly chosen, or guided by algorithms, such as projection pursuits (Cook and Swayne 2007). The latter two options are referred to as grand tours and guided tours, respectively. Figure shows a guided tour of 2D projections of the five dimensional space spanned by the real-valued variables of the crabs dataset. The “holes” projection pursuit index and geodesic interpolation between static projections provided by the R package tourr, were applied.

Cook and Swayne (2007) demonstrated how linked brushing between projections of a tour enabled purely graphical approaches towards supervised classification and cluster analysis. Although the animation required for tours can be achieved in shiny, this “spin and brush” technique cannot be facilitated due to each projection being independently rendered. Instead plotly and crosstalk were used to implement linked brushing between projections. The tourr package enables all data for the tour projections to be pre-generate before compilation as an interactive visual.

2.1.4 Scaling

The use of different scales in visuals can reveal different features of the data. TourdeFrance (??) demonstrates using data from the 2013 Tour de France competition how interactively changing axes settings allows a variety of comparisons to be made between different stages of the race, as well as the progress of the riders (see website as footnote).

Being able to zoom in or out of views of a plot also effects the scaling. This can be particularly useful for viewing busy regions of a plot (Cook and Swayne 2007). Figure XX demonstrates how the default settings for plotly objects allows the plot to be rescaled interactively according to a selected region.

2.2 Applications of interactive techniques

Other interactive techiques are out there but the techniques discussed were commonly used. Cook and Swayne (2007) demonstrated how interactive subset selection, different types of linked brushing and tours, were useful in exploring the nature of missing values and the effects of imputation. Furthermore these techniques were used to evaluate models from numerical methods, as well as to provide graphical approaches towards supervised classification and cluster analysis. Other interactive techniques, such as scaling, were encorporated at times, but the three interactive techniques of subset selection, linked brushing and tours, were repeatedly utilised across a variety of problems. Similarly Wickham, Cook, and Hofmann (2015) applied the three techniques to encourage the visualisation of statistical models in data space and explore the processes underlying “black box” methods. They demonstrated how linked brushing between visual representations of summary statistics and displays of models, such as parallel coordinate plots, allows subsets of models to be compared. Interactive subset selection helped to make the volume of data generated by algorithmic methods manageable and useful.

3 A survey of code-based software tools

The demonstrations of interactive techniques described by Cook and Swayne (2007) in Section XXX, show that a code-based approach towards achieving the scope of techniques possible in GGobi, requires the use of more than one software. This section compares the ease of application, development progress and coverage of interactive techniques, for a range of code-based software tools. Consequently leading to the recommendation of using the R packages plotly, crosstalk and shiny, as the main software tools for introducing interactive techniques in teaching.

3.1 Types of interactive software

Lang and Swayne (2001) described two types of interactive data visualisation software, those with a direct manipulation graphical environment and those driven by a command-line interface. GGobi and Mondrian are examples of software programmes that have a graphical user interface (GUI) for direct manipulation. GUI environments provide immediate response to user actions, but without technical knowledge of the underlying low-level language, it is difficult to modify or extend these types of software (Lang and Swayne 2001). On the other hand, command-line interfaces, such as R, not only allow modification of existing capabilities, but also the creation of new functions. Furthermore the R environment provides access to a vast range of techniques and tools for statistical analysis (Unwin 2015). R packages for interactive data visualisation aim to reap the benefits of direct manipulation, the extensibility of a scripting language and the statistical power of the R environment. The package rggobi connects the GGobi GUI with R, to enable complex interactive graphical analyses that would be difficult, if not impossible, to carry out without a command-line interface (Lawrence et al. 2009). Meanwhile packages like plotly, crosstalk and shiny, utilise the interactive graphics provided by the web, alongside the statistical functions accessible in R. The use of web browsers minimises the software requirements for utilising these packages. In contrast rggobi is dependent on the installation of GGobi and the specific software required for building the GGobi GUI. Consequently the ease of access to rggobi is at a disadvantage to the other R packages. Unwin (2015) highlights that the shift towards web-based software helps to make sharing and presenting analyses from interactive data visualisations easier.

Other R packages for interactive data visualisation include trelliscopejs and animint. Like plotly these software build on the comprehensive graphing system provided by the R package ggplot2 (Wickham, 2009?). By adding interactive functions, trelliscopejs, animint and plotly, convert static ggplot2 objects into web-based interactive plots. The trelliscopejs package focuses on adding interactive techniques to trellis plots. (Hafen ???) describes trelliscopejs as enabling “interactivity for free” and emphasises that adding interactivity should require little time and effort beyond that needed to create the static plot. The design behind animint reflects a similar sentiment but enables interactivity for a wider range of plot types. It enables linked brushing by introducing additional arguments to the mapping system used in ggplot2. However currently this interactive technique can only be activated by conditioning on categorical or discrete variables (Hockings git). Furthermore all calculations are pre-computed before plot compilation and hence genuine “real-time” analysis in response to interactivity, is not possible. The plotly package shares this limitation, but it enables additional interactive techniques, such as zooming and when paired with other software it is able to achieve more types of brushing. Furthermore plotly appears to be further along in its development, since it is the only package out of the three that is currently available on the Comprehensive R Archive Network (CRAN). For the purposes of teaching, it may be challenging to introduce software that is still underdevelopment. Instead of adding interactive components to ggplot2 objects, the ggvis package shares a similar syntax, inspired by the grammar of graphics (ggvis web). The interactive features enabled by ggvis are similar to those available through shiny. For example the input_slider() function in ggvis creates an interactive slider equivalent to the functionality of shiny’s sliderInput(). Both software tools need to be connected to a R session in order for the interactive plots to be active. When applying interactive data visualisation for exploration and early stages of analysis, this is particularly useful because it allows the statistical analysis in R to respond directly to interactive inputs from the visuals (ggvis web). However for the purpose of presentation, interactive graphics created using ggvis or shiny would need to be hosted on a server. In contrast, interactive plotly plots are easy to share as a standalone HTML (Seivert git). In comparision to shiny, more aspects of ggvis are still underdevelopment and hence there may be significant changes to its current interfaces (ggvis web). Furthermore, shiny has the extra advantage of being compatible with a range of other software, such as plotly and crosstalk, whilst providing similar interactive functionality as ggvis.

Ease of application, development progress and coverage of interactive techniques were the main criterion used to identify plotly, crosstalk and shiny as an appropriate set of software tools for teaching interactive data visualisation.

Compare the effort

3.2 A set of software tools

In discussing the technical tools available to statisticians, John W Tukey (1965) highlighted: “Today, software and hardware together provide far more powerful factories than most statisticians realize, factories that many of today’s most able young people find exciting and worth learning about on their own.” (p. 25). Although Tukey’s comments in 1965 were in response to the potential impact of computers on statistical practices at the time, his comments are still relevant for examining the effects of the internet and web-based graphics on interactive data visualisation. The three packages of primary focus…

3.2.1 Coverage of interactive techniques

Discuss default settings for each (ease of use) and then compare the ease of the other techniques, in particular linked brushing The tables below summarise the interactive techniques described by Cook and Swayne (2007) covered by the R packages researched so far. Linked brushing is summarised separately since there is a variety of ways to brush and link plots.

Table 3.1: Interactive techniques for Shiny, Plotly and Crosstalk.
R Package Talks back to R Identification Scaling Subset selection Drag points Tours
Shiny Y Y
(input objects)
Plotly Y
(tooltips)
Y Y
(legend icons)
Y
Crosstalk &
htmlwidgets
Y Y Y Y
(filtering)
Y
(D3)
Y

Table: (#tab:brushing) Linked brushing techniques. | R Packages | Spin and
brush | Persistent
brushing | Brush lines | Brush points
(link 1-to-1) | Categorical
brushing | |————|:———————:|:———————-:|:———–:|:—————————–:|:———————–:| | Plotly+Crosstalk | Y | Y | | Y | | | Plotly+Shiny | | Y | | Y | Y |

The plotly package provides a starting point that enables the application of a variety of interactive techniques. The default settings for a plotly object allows for tooltip identification and scaling. Students have the option of converting plots in ggplot2 to plotly objects or directly creating plots using plot_ly(). The plot_ly function takes a similar layered Grammar of Graphics (Wilkinson 2005) approach as ggplot2. Although the graphics of ggplot2 are more comprehensive, creating plots via plot_ly() should also be utilised since they interact more directly with the plotly.js library (Sievert 2017).

The tooltip argument for plotly objects allows for customisation of the information to display when hovering on a graphical feature. Brushing within a standalone plotly plot is possible on a pcp. Combining plotly with crosstalk introduces the key interactive technique of brushing to link multiple plots. This can also be achieved by using plotly and shiny together, but the SharedData environment of the crosstalk package allows brushing to be initiated in any of the linked plots by default. However the crosstalk package is not appropriate for use with large data sets and currently supports only linked brushing and filtering views, for certain HTML widgets where observations are individually displayed (see https://rstudio.github.io/crosstalk/). Further research into how the htmlwidgets package can be used with crosstalk, with or without the reactive function in shiny, is still required. The use of D3 graphics with crosstalk will enable the interactive technique of dragging points.

4 The role of interactive visualisation in data analysis

The importance of visuals to exploratory data analysis was clearly highlighted by Tukey (1977) when he stated: “The greatest value of a picture is when it forces us to notice what we never expected to see” (p.vi). This section examines the use of interactive visualisation techniques in exploratory data analysis on the performance of New Zealand schools in the National Certificate of Educational Achievement (NCEA) in 2016. The role of interactive techniques in enhancing analysis and gaining further insight than static plots, will be be highlighted and demonstrated through worked examples.

Key findings with regards to the use of interactive techniques in data analysis, or the teaching of interactive data visualisation, will be noted explicitly as they are demonstrated using the NCEA data set.

4.1 The NCEA data set

Data on the achievement rates of schools across the four NCEA qualification levels, Level One, Two, Three (L1, L2, L3) and University Entrance (UE), were obtained from the New Zealand Qualifications Authority (NZQA) website. Students need to demonstrate sufficient mastery of standards at each respective NCEA level in order to be awarded the qualification. The UE qualification differs from L3, in that the standards need to be from “university endorsed” Level Three subjects, and specific requirements for literacy and numeracy must be met.

Information on the school decile, region and a “small” cohort warning, were also provided in the data. The decile rating is a measure of the general income level of the families of students attending the school. The socio-economic background of students increases as the decile increase from one to ten. A handful of schools have a decile rating of zero, due to unique circumstances that make them exempt from the socio-economic measure. The achievement rate of a school for each qualification level was quantified in a few ways. The achievement indicator chosen for this analysis was the proportion of students at the school who were successful in obtaining the qualification level, given that they were entered in enough standards to have the opportunity to earn the qualification in the 2016 school year. This is referred to as the “Current Year Achievement Rate” for the “Participating Cohort” by the NZQA (see http://www.nzqa.govt.nz/assets/Studying-in-NZ/Secondary-school-and-NCEA/stats-reports/NZQA-Secondary-Statistics-Consolidated-Data-Files-Short-Guide.pdf).

Only schools with achievement indicators across all four qualification levels were retained, thus reducing the data set to 408 schools from around New Zealand. The focus of analysis will be on its subset of 91 Auckland schools, but the New Zealand data set of 408 schools will be used to demonstrate how interactive techniques can aid graphical analysis when observations increase. The sensitivity of analysis to samples size for the New Zealand schools data set will also be explored interactively.

The focus of analysis will be on the Auckland subset because it is less affected by the unreliability of small sample sizes. Auckland has many of the larger schools since it is the most populated city in New Zealand. The first few observations from the data set of 91 schools in the Auckland region are shown below.

##                              L1    L2    L3    UE Decile
## Al-Madinah School         0.889 1.000 1.000 0.640      2
## Albany Senior High School 0.905 0.904 0.882 0.701     10
## Alfriston College         0.659 0.651 0.563 0.369      2

4.2 Static visual analysis

A question that naturally arises from the NCEA data is whether a school’s performance is related to its decile rating. Using graphics in data analysis entails considering multiple views of the data, starting with simple low dimensional representations before examining complex multivariate structures (??? Cook and Swayne (2007)).

The pairs plot in Figure 4.1 shows the achievement rates at L1, L2 and L3 have a weak relationship with decile rating, but the positive correlation is strong at UE. There appears to be an increasing “lower bound” to achievement rates for L1, L2 and L3, as decile increases, but there is a lot of scatter above this boundary. In the bivariate scatterplots we can also see the spread of achievement rates varying across decile groups. The variation in achievement rates decreases as the decile increases (from one), across the L1, L2 and L3 qualification levels. We can see many schools approaching the maximum 100% achievement rate, for L1, L2 and L3, hence it is not surprising to see their univariate distributions are skewed to the left in Figure 4.2. The distribution of achievement rates for UE is less skewed and hints at two possible groupings. Furthermore performance across the qualification levels appear to be positively correlated, especially between L1 and L2. Hence the low-dimensional plots indicate non-normality, unequal spread between groups and multicollinearity.

Pairs plot for NCEA data on Auckland schools.

Figure 4.1: Pairs plot for NCEA data on Auckland schools.

Univariate distributions for NCEA data on Auckland schools.

Figure 4.2: Univariate distributions for NCEA data on Auckland schools.

4.3 Multivariate visual representations

The pairs plot in Figure 4.1 provided a glimpse into the multivariate distribution of achievement rates across the four qualification levels. The parallel coordinates plot (PCP) shown below in Figure 4.3 allows us to further compare the multivariate distributions of achievement rates for different decile groups, as well as identify high dimensional clusters and outliers, if they exist.

The ordering of axes in a PCP greatly affects the quality of the graphical analysis, hence interactivity that enables reordering of axes is recommended (Unwin, 2015). In the case of the NCEA data, the natural ordering of the four qualification levels by difficulty, conincides with the recommendation from Cook and Swayne (2007) to order the axes based on correlation. In addition, Unwin (2015) highlights the layering of colours also needs to be considered carefully, since the last group assigned a colour will dominate the other lines.

The positive relationships previously identified in the pairs plot, should translate to approximately horizontal lines between the parallel axes in the PCP, as opposed to sloped or “criss-crossed” lines for negative correlation. The static plot in Figure 4.3 questions whether the positive relationships hold true for Auckland schools with low achievement rates and in some decile groups. In particular, there appears to be a negative relationship between achievement rates at L3 and UE for many schools. The achievement rates for UE are clearly the most variable.

The higher decile schools in Auckland appear to dominate the high achievement rates across all qualification levels, while lower decile schools are less consistent with each other in terms of their performance across the levels. Although there are only 91 observations (lines), it is quite difficult to identify even “ball park boundaries”, on the 10-point decile scale, to distinguish between “higher” and “lower” decile schools when describing possible patterns.

Parallel coordinates plot for NCEA data on Auckland schools.

Figure 4.3: Parallel coordinates plot for NCEA data on Auckland schools.

The following two plots demonstrate how alpha blending can help minimise the effects of overplotting as the number of observations increase. It is easier to check whether the patterns identified in Auckland schools extend to the 408 schools across New Zealand, using Figure 4.5 where alpha blending is applied, rather than Figure 4.4. The performance of high achieving lower decile schools is less “drowned out” by the dominance of their higher decile counterparts, when alpha blending is used. Figure 4.5 reveals a group of schools converging at 100% achievement for L3, but with varying levels of success at L2 and UE. It would be of interest to compare the achievement rates of these schools across the qualification levels. Similarly, we would be interested in tracking the performance of the school with the lowest achievement rate at L1. The school appears to make a convincing recovery in performance at L2, but it is impossible to follow its progress further in a static PCP due to overplotting. Interactive techniques will be later used to explore these points of interest.

Parallel coordinates plot for New Zealand schools, without alpha blending.

Figure 4.4: Parallel coordinates plot for New Zealand schools, without alpha blending.

Parallel coordinates plot for New Zealand schools, with alpha blending.

Figure 4.5: Parallel coordinates plot for New Zealand schools, with alpha blending.

4.4 Leveraging static plots with interactivity

Although one of the strengths of a PCP is in identifying multivariate features, such as outliers (Cook & Swayne, 2007), the static plots in Figures 4.4 and 4.5 do not allow these features to be explored, due to overplotting. The use of colour and interactive techniques are recommended for maximising the effectiveness of a PCP. Venables and Ripley (2002) argue parallel coordinate plots are “often too ‘busy’ without means of interaction” (p. 315). The interactive filtering feature in Figure 4.6 addresses the problem of overplotting by allowing the user to isolate the distribution of each decile group (via a double-click on the legend). Furthermore, using the mouse to drag-and-select nodes on the parallel axes, brushes and links the acheivement rates of individual schools across all qualification levels. Lastly the hovering tooltip enables instant identification of indvidual schools by name.

Interactive filtering alleviates issuses with overplotting when the number of observations increase.

Figure 4.6: Parallel coordinates plot with interactivity.

Figure 4.7 demonstrates the use of interactive techniques to explore the outlier at L1, previously identified in Figure 4.5. The tooltip identifies the school and the linked brushing reveals that despite its poor performance at L1, the school had 100% achievement rates at L2 and L3.

Linked brushing to examine outliers.

Figure 4.7: Linked brushing to examine outliers.

Cook and Swayne (2007) describe how linked brushing enables dynamic database querying and direct comparisons between subsets of data and the general distribution. Figure 4.8 illustrates how brushing the point where L3=100% on the interactive PCP, highlights the extent of the variability in performance at the other qualification levels. Surprisingly, the distribution of UE achievement rates for schools with 100% achievement at L3, is just as spread out as the overall performance of New Zealand schools in the data set. Many of these schools also obtained 100% achievement rates at L2, but again there is surprisingly variable success at L1.

Linked brushing to subset groups of interest.

Figure 4.8: Linked brushing to subset groups of interest.

The query made via linked brushing in Figure 4.8 is similar to subsetting the data using the code shown below. A summary could be used to examine the performance of the subset of 42 schools across the other qualifications, but making sense of this summary would required comparison with statistics for the whole data set. Furthermore the summary hides the unusual performance of individual schools at certain qualification levels. The interactive techniques allowed us to gain insights about the NCEA data, that would have been difficult to ascertain from static plots or numeric summaries alone.

Linked brushing and tooltip identification, quickly reveals further insight into patterns underlying unusual groups, or individuals.

L3all_achieve <- nzqa[nzqa$L3==1, c("L1", "L2", "UE")]
nrow(L3all_achieve)
## [1] 42
summary(L3all_achieve)
##        L1               L2               UE        
##  Min.   :0.1430   Min.   :0.5710   Min.   :0.1540  
##  1st Qu.:0.8407   1st Qu.:0.9073   1st Qu.:0.5000  
##  Median :0.9260   Median :1.0000   Median :0.7700  
##  Mean   :0.8848   Mean   :0.9463   Mean   :0.7025  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.9380  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

However the ease and effectiveness of linked brushing can also be affected by overplotting. Figure 4.9 demonstrates how an attempt to use persistent brushing to identify the UE success rate of the school with 100% achievement at L3, but unusually low performance at L2, resulted in accidentally brushing an observation with a similar L2 achievement rate. The plotly R package used in Figure 4.6 allows only the nodes of the PCP to be brushed. The ggobi GUI applies the same drag-and-select motion to brush both points and lines on a PCP (Cook & Swayne, 2007). This greater flexibility would have avoided the issues caused by overplotting in this case, but some of the challenges of large data sets for static graphical displays will persist, even with interactive techniques. In particular the speed of rendering plots for large data sets determines whether the techinques can be applied “fast enough to be considered interactive” (Unwin 2015, p.20).

The effectiveness of interactive techniques for large data sets will be affected by rendering delays and limitations imposed by plot size and resolution.

Accuracy of brushing is affected by overplotting.

Figure 4.9: Accuracy of brushing is affected by overplotting.

The PCP in Figure 4.3 suggests that patterns of performance across decile groups, if they exist, are difficult to distinguish at the multivariate level for Auckland schools. Principal component analysis (PCA) provides a way to reveal interesting multivariate structure, through finding projections of the data that show maximal variability (Venables & Ripley, 2002).

A plot of the first two principal components of the Auckland schools data set is shown in Figure 4.10. The decile of the schools is represented by the colour and plotting symbol. The axes for the original variables, reflecting the loadings of the principal components, indicate that the first principal component reflects the schools’ general performance across all four qualification levels, while the second principal component contrasts performance in UE against the remaining qualifications. Not surprisingly the plot shows more spread across the first principal component since it explains a much greater proportion of the variation in the data. The first principal component reveals a division between the majority of schools and a smaller group, that is positioned away from the variable axes shown. We can see the smaller group of schools are decile five and below, except for one decile nine school. There is also a decile ten school that appears to be unusual when examining both principal components. The use of colour highlights the remaining decile nine and ten schools as similar in performance, as weighted by the principal components. On the other hand, schools from the other deciles seem to be more spread out from each other in the PCA plot.

(Cook pg 70, when many levels, colours difficult)

First two principal components for Auckland schools

Figure 4.10: First two principal components for Auckland schools

The principal components plot in Figure 4.10 provided a view of the multivariate distribution of the Auckland schools data set that was less easily affected by overplotting than a static PCP. Hence refinements on observations were possible. From the parallel coordinate plots we observed that “higher” decile schools performed more consistently with each other and dominating across the four qualification levels, but it was difficult to quantify the decile “cut off” for such schools. The PCA suggests these schools are the decile nine and ten Auckland schools, with the exception of two schools. Again interactive tooltips enable instant identification of these two unusual observations, in Figure 4.11.

Furthermore the two plots in Figure 4.11 are interactively linked by brushing. Individual or groups of points in the PCA plot can be brushed via a drag-and-select motion using the mouse. The “Lasso Select” option can be activated from the hovering menu on the top right hand corner, if the region of points to be selected is irregular rather than rectangular. The original achievement rates of the selected schools will be highlighted in the PCP, while the lines for the unselected schools will be dimmed. Similarly brushing on the nodes of the PCP highlights the observations on the PCA plot.

Figure 4.11: First two principal components for Auckland schools with interactivity.

Figure 4.12 demonstrates the use of the lasso brush to highlight the original variable values of the central group of points in the PCA plot. The linked brushing and tooltip identification reveals that these Auckland schools have at least 80% achievement rates at all four qualification levels, except for UE. We are also able to identify the only decile one school in this group.

Brushing a central group of points in the PCA plot links with the decile and achievement rates of the selected schools.

Figure 4.12: Brushing a central group of points in the PCA plot links with the decile and achievement rates of the selected schools.

In a similar manner to Figure 4.9, persistent brushing can be applied with different colours to further explore the multivariate features revealed by the PCA. The three outliers brushed in the PCA plot are at two extremes. Comparing their achievement rates on the PCP helps to explain why they are unusual in different ways. Two of the outliers had consistently high performance at the first three qualification levels, but there was a drastic drop in the UE achievement rate. Although, as previously observed, lower performance at UE was shared by the Auckland schools in general, the large differences between the two schools’ achievement rates for UE and the remaining qualification levels, are still unexpected. The single outlier at the other extreme of the PCA plot, defies the general trend. Its L1 achievement rate was inconsistently lower than the other qualification levels, rather its UE rate. Although all three schools had unusual performance compared to Auckland schools in general, the inconsistencies in their performance differed. The linked brushing enabled us to further investigate the differences in “unusualness” and it also highlights how the multivariate outliers identified in the PCA are hidden in the PCP.

Linked brushing between plots allows different visual representations of the same data to be related together, so that abstract methods can be explored further to reveal more insights about the multivariate features.

Brushing outliers on the PCA plot identifies inconsistent patterns of achievement on the PCP.

Figure 4.13: Brushing outliers on the PCA plot identifies inconsistent patterns of achievement on the PCP.

Initiating the linked brushing from the PCP is also useful. Figure @ref(fig: linkOutPCP) confirms the outliers visible in the PCP are also identified as outliers in the PCA. The PCP highlights schools that have unusually low achievement rates at a particular qualification level, while the outliers from the PCA encompasses these schools, as well as those that are unusual in terms of patterns of performance. (Comment on use of different multivariate representations in analysis, Unwin).

Brushing outliers on the PCP confirms their status as also outliers in the PCA plot.

Figure 4.14: Brushing outliers on the PCP confirms their status as also outliers in the PCA plot.

The NZQA indicator of a “small” cohort was at a very low threshold of fewer than five candidates at any qualification level. Hence information on the number of Year 11, 12 and 13 students at each school in 2016, was sourced from the New Zealand government website, Education Counts (2017). The minimum cohort size for the three senior year levels will be used as indicator of whether the NCEA achievement rates were based on small sample sizes. The website also provided information on the ethnic composition of the New Zealand schools in terms of six general categories: Maori, Pasifika, Asian, Middle Eastern/Latin American/African (MELAA), Other and European/Pakeha. Hence the merged data set contained information on the NCEA achievement rates, region, type and total roll size of each New Zealand school, as well as the proportions of students in the six ethnic categories and a minimum cohort size based on the number of students at the senior year levels. The first few observations of the merged data set are shown below.

##                      School    L1    L2    L3    UE Decile
## 1        Akaroa Area School 1.000 0.800 1.000 1.000      8
## 2         Al-Madinah School 0.889 1.000 1.000 0.640      2
## 3 Albany Senior High School 0.905 0.904 0.882 0.701     10
##                    Region                  Type Total Maori Pasifika Asian
## 1 Christchurch/Canterbury Composite (Year 1-15)   144  0.20     0.00  0.01
## 2                Auckland Composite (Year 1-15)   530  0.00     0.00  0.93
## 3                Auckland Secondary (Year 9-15)   759  0.08     0.01  0.19
##   MELAA Other EuropeanPakeha Year.11 Year.12 Year.13 Min.Cohort
## 1  0.03  0.00           0.76      15       9       6          6
## 2  0.06  0.01           0.00      27      30      18         18
## 3  0.08  0.02           0.62     252     261     246        246

Figure 4.15 allows the effect of sample size on PCA to be explored via an interactive slider that filters schools according to their minimum cohort size. The effect of varying the sample size can be dynamically viewed by selecting the “play” icon on the slider, as shown in Figure 4.15.

Interactivity to explore the effect of sample size. See <https://shanl33.shinyapps.io/ncea/>.

Figure 4.15: Interactivity to explore the effect of sample size. See https://shanl33.shinyapps.io/ncea/.

The pattern previously noted for Auckland schools, where decile nine and ten schools were more consistent with each other in performance than the other deciles, appears to hold for New Zealand schools in general, as we vary the minimum cohort size of schools considered in the PCA. Not surprisingly, as the minimum cohort size increases, the spread in the PCA plot decreases and the first principal component is able to explain more of the variation in the data. The loadings, represented by the plot axes, remain reasonably consistent. The first principal component generally measures overall performance across the four qualification levels, while the second component contrasts L1 and L2 preformance against L3 and UE.

Dynamic visualisation allows multiple analyses to be computed and compared quickly. The flexibility of being able to pause animations allows closer examination of details, as the need arises.

The interactive drop-down menu also allows us to quickly verify whether there are any underlying relationships between achievement rates and demographics other than decile, that are worth further investigation. Although the other demographic variables did not reveal any further insight, the ease with which these additional variables could be interactively linked to the PCA plot, justified their inclusion in the exploratory data analysis. Enabling the interactive drop-down menu in Figure 4.15 involved adding the following single command to the code.

selectInput("group", "Select variable", choices=as.list(colnames(nzqa.sch)[6:15]), selected="Decile")

When interactive techniques can be applied with ease, the trade-off between the effort required to implement the technique and the level of insight gained, is generally favourable.

** In addition, or not, could using min cohort size to explore standardised PCP ** Explore whether conclusions about above and below average schools across deciles hold (eg. the spread in decile 1 schools due to small sizes? Decile 5 really the “average school”?)

## # A tibble: 6 x 7
##                           School         L1          L2         L3
##                           <fctr>      <dbl>       <dbl>      <dbl>
## 1             Akaroa Area School  1.2040643 -1.17972038  1.3859252
## 2              Al-Madinah School  0.2576809  1.18511081  1.3859252
## 3      Albany Senior High School  0.3940965  0.04999184  0.4120941
## 4              Alfriston College -1.7032936 -2.94151962 -2.2205508
## 5              Amuri Area School -1.4219364  0.73579288 -0.2646359
## 6 Ao Tawhiti Unlimited Discovery -1.0894233 -2.07835624 -0.8175739
## # ... with 3 more variables: UE <dbl>, Decile <fctr>, Region <fctr>

The insights into multidimensional data structures gained from individual static plots were leveraged and further explored when interactive techniques were applied.

Cook, Dianne, and Deborah F. Swayne. 2007. Interactive and Dynamic Graphics for Data Analysis with R and Ggobi. 1st ed. Springer Publishing Company, Incorporated.

Lang, Duncan Temple, and Deborah F Swayne. 2001. “GGobi Meets R: An Extensible Environment for Interactive Dynamic Data Visualization.” In Proceedings of Dsc, 2.

Lawrence, Michael, Hadley Wickham, Dianne Cook, Heike Hofmann, and Deborah F. Swayne. 2009. “Extending the GGobi Pipeline from R.” Computational Statistics 24 (2): 195–205. doi:10.1007/s00180-008-0115-y.

Tukey, John W. 1965. “The Technical Tools of Statistics.” The American Statistician 19 (2). Taylor & Francis: 23–28.

Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley.

Unwin, Antony. 2015. Graphical Data Analysis with R. Vol. 27. CRC Press.

Wickham, Hadley, Dianne Cook, and Heike Hofmann. 2015. “Visualizing Statistical Models: Removing the Blindfold.” Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (4). Wiley Subscription Services, Inc., A Wiley Company: 203–25. doi:10.1002/sam.11271.

Wickham, Hadley, Dianne Cook, Heike Hofmann, Andreas Buja, and others. 2011. “Tourr: An R Package for Exploring Multivariate Data with Projections.” Journal of Statistical Software 40 (2). Foundation for Open Access Statistics: 1–18.